WikiData Import 2022-05-21

From BITPlan Wiki
Jump to navigation Jump to search

❌ This attempt failed - see https://github.com/ad-freiburg/qlever-control/issues/4.

see QLever/script as discussed in QLever Issue #562 for the script which makes reproducing this attempt easier. This trial uses the "old" script only for the environment check, download and clone.

QLever trial

>=64 GB RAM and docker environment (e.g. Ubuntu) >1 TB diskspace (SSD preferred for speed)

./qlever -v -e
qlever version : 1.27 $ : 2022/03/16 08:54:18 $
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal
docker version
Docker version 20.10.13, build a224086
memory
              total        used        free      shared  buff/cache   available
Mem:          125Gi       1,1Gi       121Gi        31Mi       2,9Gi       123Gi
Swap:         2,0Gi          0B       2,0Gi
diskspace
/dev/sdb5       116G   23G   88G  21% /
tmpfs            63G     0   63G   0% /dev/shm
/dev/sda1       3,6T  987G  2,5T  29% /hd/seel
/dev/sdb1       511M  4,0K  511M   1% /boot/efi
soft ulimit for files
1048576

QLever clone

./qlever -c
cloning qlever - please wait typically 1 min ...
cloning qlever started at Sa 21. Mai 08:33:35 CEST 2022
Cloning into 'qlever-code'...
remote: Enumerating objects: 13828, done.
remote: Counting objects: 100% (973/973), done.
remote: Compressing objects: 100% (705/705), done.
remote: Total 13828 (delta 574), reused 451 (delta 267), pack-reused 12855
Receiving objects: 100% (13828/13828), 111.72 MiB | 6.86 MiB/s, done.
Resolving deltas: 100% (10707/10707), done.
Submodule 'third_party/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'third_party/abseil-cpp'
Submodule 'third_party/antlr4' (https://github.com/antlr/antlr4.git) registered for path 'third_party/antlr4'
Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest'
Submodule 'third_party/re2' (https://github.com/google/re2.git) registered for path 'third_party/re2'
Submodule 'third_party/stxxl' (https://github.com/ad-freiburg/stxxl) registered for path 'third_party/stxxl'
Cloning into '/hd/seel/qlever/qlever-code/third_party/abseil-cpp'...
remote: Enumerating objects: 16841, done.        
remote: Counting objects: 100% (149/149), done.        
remote: Compressing objects: 100% (78/78), done.        
remote: Total 16841 (delta 83), reused 112 (delta 71), pack-reused 16692        
Receiving objects: 100% (16841/16841), 10.55 MiB | 6.78 MiB/s, done.
Resolving deltas: 100% (13078/13078), done.
Cloning into '/hd/seel/qlever/qlever-code/third_party/antlr4'...
remote: Enumerating objects: 128025, done.        
remote: Counting objects: 100% (13/13), done.        
remote: Compressing objects: 100% (11/11), done.        
remote: Total 128025 (delta 3), reused 3 (delta 1), pack-reused 128012        
Receiving objects: 100% (128025/128025), 65.33 MiB | 6.76 MiB/s, done.
Resolving deltas: 100% (75484/75484), done.
Cloning into '/hd/seel/qlever/qlever-code/third_party/googletest'...
remote: Enumerating objects: 24402, done.        
remote: Counting objects: 100% (67/67), done.        
remote: Compressing objects: 100% (32/32), done.        
remote: Total 24402 (delta 31), reused 53 (delta 28), pack-reused 24335        
Receiving objects: 100% (24402/24402), 10.27 MiB | 6.87 MiB/s, done.
Resolving deltas: 100% (18049/18049), done.
Cloning into '/hd/seel/qlever/qlever-code/third_party/re2'...
remote: Enumerating objects: 7130, done.        
remote: Counting objects: 100% (961/961), done.        
remote: Compressing objects: 100% (86/86), done.        
remote: Total 7130 (delta 891), reused 878 (delta 875), pack-reused 6169        
Receiving objects: 100% (7130/7130), 3.18 MiB | 6.86 MiB/s, done.
Resolving deltas: 100% (5485/5485), done.
Cloning into '/hd/seel/qlever/qlever-code/third_party/stxxl'...
remote: Enumerating objects: 40997, done.        
remote: Counting objects: 100% (60/60), done.        
remote: Compressing objects: 100% (40/40), done.        
remote: Total 40997 (delta 22), reused 39 (delta 12), pack-reused 40937        
Receiving objects: 100% (40997/40997), 14.15 MiB | 6.80 MiB/s, done.
Resolving deltas: 100% (30921/30921), done.
Submodule path 'third_party/abseil-cpp': checked out 'b9b925341f9e90f5e7aa0cf23f036c29c7e454eb'
Submodule path 'third_party/antlr4': checked out 'e4c1a74c66bd5290364ea2b36c97cd724b247357'
Submodule path 'third_party/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'third_party/re2': checked out '13ebb377c6ad763ca61d12dd6f88b1126bd0b911'
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 215 bytes | 215.00 KiB/s, done.
From https://github.com/ad-freiburg/stxxl
 * branch              4f368a8eacc965a775f208df0c2d3a0721f4bdf1 -> FETCH_HEAD
Submodule path 'third_party/stxxl': checked out '4f368a8eacc965a775f208df0c2d3a0721f4bdf1'
Submodule 'extlib/foxxll' (https://github.com/ad-freiburg/foxxll.git) registered for path 'third_party/stxxl/extlib/foxxll'
Cloning into '/hd/seel/qlever/qlever-code/third_party/stxxl/extlib/foxxll'...
remote: Enumerating objects: 21414, done.        
remote: Counting objects: 100% (28/28), done.        
remote: Compressing objects: 100% (22/22), done.        
remote: Total 21414 (delta 9), reused 13 (delta 4), pack-reused 21386        
Receiving objects: 100% (21414/21414), 4.60 MiB | 2.12 MiB/s, done.
Resolving deltas: 100% (15789/15789), done.
Submodule path 'third_party/stxxl/extlib/foxxll': checked out '8cbca7bedcdb0b84a6de99e927c5fa27a4bbbfb2'
Submodule 'extlib/tlx' (https://github.com/joka921/tlx.git) registered for path 'third_party/stxxl/extlib/foxxll/extlib/tlx'
Cloning into '/hd/seel/qlever/qlever-code/third_party/stxxl/extlib/foxxll/extlib/tlx'...
remote: Enumerating objects: 3418, done.        
remote: Counting objects: 100% (53/53), done.        
remote: Compressing objects: 100% (33/33), done.        
remote: Total 3418 (delta 25), reused 39 (delta 20), pack-reused 3365        
Receiving objects: 100% (3418/3418), 1.11 MiB | 6.59 MiB/s, done.
Resolving deltas: 100% (2612/2612), done.
Submodule path 'third_party/stxxl/extlib/foxxll/extlib/tlx': checked out 'ef81a598d9880cc7d242afc47de7328634f07f1d'
cloning qlever finished at Sa 21. Mai 08:34:20 CEST 2022 after 45 seconds

Wikidata dump download

./qlever --wikidata_download
qlever-indices/wikidata already exists
wikidata.settings.json already copied to qlever-indices/wikidata
downloading wikidata lexemes:latest-lexemes.ttl.bz2 ... please wait typically 3min ...
wikidata lexemes download started at Sa 21. Mai 08:38:56 CEST 2022
--2022-05-21 08:38:56--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 327629685 (312M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’

latest-lexemes.ttl. 100%[===================>] 312,45M  4,17MB/s    in 77s     

2022-05-21 08:40:14 (4,08 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [327629685/327629685]

wikidata lexemes download finished at Sa 21. Mai 08:40:14 CEST 2022 after 78 seconds
downloading wikidata dump:latest-all.ttl.bz2 ... please wait typically 6hours ...
wikidata dump download started at Sa 21. Mai 08:42:59 CEST 2022
92434800K .......... ....                                       100% 86,4M=15m0s

2022-05-21 14:30:17 (4,49 MB/s) - ‘latest-all.ttl.bz2’ saved [94653250500/94653250500]

wikidata dump download finished at Sa 21. Mai 14:30:17 CEST 2022 after 20838 seconds

Build code

see WikiData_Import_2022-03-16#Native_approach for preparation steps and follow steps of https://github.com/ad-freiburg/qlever/blob/master/Dockerfiles/Dockerfile.Ubuntu20.04

wf@sun:/hd/seel/qlever/qlever-code$ mkdir build
wf@sun:/hd/seel/qlever/qlever-code$ cd build
wf@sun:/hd/seel/qlever/qlever-code/build$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER="g++-11" -DLOGLEVEL=INFO -DUSE_PARALLEL=true -GNinja ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 11.1.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++-11
-- Check for working CXX compiler: /usr/bin/g++-11 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test HAS_COROUTINES
-- Performing Test HAS_COROUTINES - Success
-- Building without demo. To enable demo build use: -DWITH_DEMO=True
CMake Deprecation Warning at third_party/antlr4/runtime/Cpp/CMakeLists.txt:31 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


CMake Deprecation Warning at third_party/antlr4/runtime/Cpp/CMakeLists.txt:32 (CMAKE_POLICY):
  The OLD behavior for policy CMP0045 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


CMake Deprecation Warning at third_party/antlr4/runtime/Cpp/CMakeLists.txt:33 (CMAKE_POLICY):
  The OLD behavior for policy CMP0042 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


CMake Deprecation Warning at third_party/antlr4/runtime/Cpp/CMakeLists.txt:38 (CMAKE_POLICY):
  The OLD behavior for policy CMP0059 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


CMake Deprecation Warning at third_party/antlr4/runtime/Cpp/CMakeLists.txt:39 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- Checking for module 'uuid'
--   Found uuid, version 2.34.0
-- Output libraries to /hd/seel/qlever/qlever-code/dist
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found the following ICU libraries:
--   uc (required)
--   i18n (required)
-- Found ICU: /usr/include (found suitable version "66.1", minimum required is "60") 
-- Checking for module 'jemalloc'
--   Found jemalloc, version 5.2.1_0
CMake Warning at /usr/share/cmake-3.16/Modules/FindBoost.cmake:1161 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1283 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1921 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:88 (find_package)


CMake Warning at /usr/share/cmake-3.16/Modules/FindBoost.cmake:1161 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1283 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1921 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:88 (find_package)


CMake Warning at /usr/share/cmake-3.16/Modules/FindBoost.cmake:1161 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1283 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:1921 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:88 (find_package)


-- Found Boost: /usr/include (found suitable version "1.74.0", minimum required is "1.74") found components: iostreams program_options regex 
-- Found Python: /usr/bin/python3.8 (found version "3.8.10") found components: Interpreter 
CMake Warning at third_party/abseil-cpp/CMakeLists.txt:74 (message):
  A future Abseil release will default ABSL_PROPAGATE_CXX_STD to ON for CMake
  3.8 and up.  We recommend enabling this option to ensure your project still
  builds correctly.


-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Detected git refspec 1.4.1-825-g4f368a8e sha 4f368a8eacc965a775f208df0c2d3a0721f4bdf1
-- Performing Test CXX_HAS_FLAGS_WEXTRA
-- Performing Test CXX_HAS_FLAGS_WEXTRA - Success
-- Performing Test CXX_HAS_TEMPLATE_DEPTH
-- Performing Test CXX_HAS_TEMPLATE_DEPTH - Success
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test HAVE_STD_MUTEX
-- Performing Test HAVE_STD_MUTEX - Success
-- Performing Test HAVE_STD_THREAD
-- Performing Test HAVE_STD_THREAD - Success
-- Looking for C++ include random
-- Looking for C++ include random - found
-- Checking for 64-bit off_t
-- Checking for 64-bit off_t - present
-- Checking for fseeko/ftello
-- Checking for fseeko/ftello - present
-- Performing Test STXXL_HAVE_O_DIRECT
-- Performing Test STXXL_HAVE_O_DIRECT - Success
-- Looking for mmap
-- Looking for mmap - found
-- Performing Test STXXL_HAVE_LINUXAIO_FILE
-- Performing Test STXXL_HAVE_LINUXAIO_FILE - Success
-- Performing Test STXXL_HAVE_SYNC_ADD_AND_FETCH
-- Performing Test STXXL_HAVE_SYNC_ADD_AND_FETCH - Success
-- OpenMP disabled in STXXL (no parallelism is used).
-- Looking for mallinfo
-- Looking for mallinfo - found
-- Looking for mlock
-- Looking for mlock - found
-- Detected git refspec 1.4.1-460-g8cbca7be sha 8cbca7bedcdb0b84a6de99e927c5fa27a4bbbfb2
-- Performing Test FOXXLL_HAVE_O_DIRECT
-- Performing Test FOXXLL_HAVE_O_DIRECT - Success
-- Looking for mmap
-- Looking for mmap - found
-- Performing Test FOXXLL_HAVE_LINUXAIO_FILE
-- Performing Test FOXXLL_HAVE_LINUXAIO_FILE - Success
-- Performing Test TLX_CXX_HAS_CXX17
-- Performing Test TLX_CXX_HAS_CXX17 - Success
-- TLX CMAKE_CXX_FLAGS: -Wshadow -Wold-style-cast -std=c++17 -g -W -Wall -Wextra -fPIC   -Wall -Wextra  -fopenmp -W -Wall -pedantic -Wno-long-long -Wextra -ftemplate-depth=1024 -W -Wall -pedantic -Wno-long-long -Wextra -ftemplate-depth=1024 -Wcast-qual -Winit-self -Wnoexcept -Woverloaded-virtual -Wredundant-decls
-- ---
-- CXX_FLAGS are :   -Wall -Wextra  -fopenmp 
-- CXX_FLAGS_RELEASE are : -O3 -DNDEBUG -O3
-- CXX_FLAGS_DEBUG are : -g
-- IMPORTANT: Make sure you have selected the desired CMAKE_BUILD_TYPE
-- CMAKE_BUILD_TYPE is Release
-- ---
-- Configuring done
-- Generating done
-- Build files have been written to: /hd/seel/qlever/qlever-code/build

Ninja build

see https://ninja-build.org/manual.html

ninja
...
[611/611] Linking CXX executable test/SparqlAntlrParserTest

Install needed packages

sudo apt-get install -y wget python3-yaml unzip curl bzip2 pkg-config libicu-dev python3-icu libgomp1 uuid-runtime
sudo apt install -y lbzip2 libjemalloc-dev libzstd-dev

QLever Control

git clone https://github.com/ad-freiburg/qlever-control
Cloning into 'qlever-control'...
remote: Enumerating objects: 205, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 205 (delta 8), reused 24 (delta 6), pack-reused 160
Receiving objects: 100% (205/205), 78.51 KiB | 1.48 MiB/s, done.
Resolving deltas: 100% (63/63), done.

QLever Control setup

 . qlever-control/qlever 

Checking your PATH ...

Added the directory "/hd/seel/qlever/qlever-control" to your PATH
Added the directory "/local/data/qlever/qlever-code/build" to your PATH

Setting up bash autocompletion ...

Done, the following completions are now available:

autocompletion-warmup cache-stats cat-files clear-cache clear-cache-complete 
disk-usage docker-off docker-on download-data help help-install index 
index-stats log log-until-server-up memory-usage pin-INTERNAL rdf-files 
remove-data remove-index restart server-settings start status stop 
text-input-from-nt-literals ui update wait where

Creating new Qleverfile ...

There was no "Qleverfile" in this directory yet, so I created one for you. A
Qleverfile contains basic configuration telling the "qlever" command how to do
certain things. Please check and modify as you see fit.

Checking .settings.json file ...

The index builder currently also requires a file "must_specify.settings.json", and I
created an empty one for you. In this file, you can specify specialized settings
for the indexer, for example: prefixes of IRIs for which the names are stored on
disk (to save RAM), the locale by which literals are sorted, the batch size used
by the indexer (default: 10M), and whether the input is well behaved in a
certain way (which enables faster indexing). See the QLever Wiki for more
information.

Setup is complete

Type "qlever" and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
"qlever index show"). Typing "qlever" without arguments gives some basic help
and pointers for further help. Edit your local "Qleverfile" to change settings.

Olympics trial

wget https://github.com/wallscope/olympics-rdf/raw/master/data/olympics-nt-nodup.zip
unzip olympics-nt-nodup.zip
ls -l olympics.nt 
-rw-r--r-- 1 wf wf 337695911 Nov  1  2018 olympics.nt

Indexing

see https://github.com/ad-freiburg/qlever-control/issues/2 for the workaround to not be forced to modify PATH

cat olympics.nt | /hd/seel/qlever/qlever-code/build/IndexBuilderMain -F ttl -K olympics -f - -i olympics -s olympics.settings.json | tee olympics.index-log.txt
2022-05-21 09:26:34.212	- INFO:  QLever IndexBuilder, compiled on May 21 2022 08:50:52
2022-05-21 09:26:34.212	- INFO:  You specified the input format: TTL
2022-05-21 09:26:34.212	- INFO:  Locale was not specified in settings file, default is en_US
2022-05-21 09:26:34.212	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 0"
2022-05-21 09:26:34.213	- INFO:  You specified "num-triples-per-batch = 10,000,000", choose a lower value if the index builder runs out of memory
2022-05-21 09:26:34.213	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2022-05-21 09:26:34.213	- INFO:  Processing input triples from /dev/stdin ...
2022-05-21 09:26:42.491	- INFO:  Done, total number of triples read: 1,781,625 [may contain duplicates]
2022-05-21 09:26:42.491	- INFO:  Number of QLever-internal triples created: 274,052 [may contain duplicates]
2022-05-21 09:26:42.491	- INFO:  Merging partial vocabularies in byte order (internal only) ...
2022-05-21 09:26:43.016	- INFO:  Number of words in internal vocabulary: 543,750
2022-05-21 09:26:43.016	- INFO:  Merging partial vocabularies in Unicode order (internal and external) ...
2022-05-21 09:26:43.309	- INFO:  Number of words in external vocabulary: 0
2022-05-21 09:26:43.309	- INFO:  Removing temporary files ...
2022-05-21 09:26:43.325	- INFO:  Converting external vocabulary to binary format ...
2022-05-21 09:26:43.325	- INFO:  Converting triples from local IDs to global IDs ...
2022-05-21 09:26:44.105	- INFO:  Done, total number of triples converted: 2,055,677
2022-05-21 09:26:44.108	- INFO:  Building prefix tree from internal vocabulary ...
2022-05-21 09:26:44.419	- INFO:  Computing maximally compressing prefixes (greedy algorithm) ...
2022-05-21 09:26:45.355	- INFO:  Reduction of size of internal vocabulary: 61%
2022-05-21 09:26:45.383	- INFO:  Writing compressed vocabulary to disk ...
2022-05-21 09:26:47.199	- INFO:  Creating a pair of index permutations ... 
2022-05-21 09:26:47.898	- INFO:  Statistics for PSO: #relations = 18, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:47.898	- INFO:  Statistics for POS: #relations = 18, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:47.898	- INFO:  Exchanging multiplicities for PSO and POS ...
2022-05-21 09:26:47.898	- INFO:  Writing meta data for PSO and POS ...
2022-05-21 09:26:49.557	- INFO:  Creating a pair of index permutations ... 
2022-05-21 09:26:50.153	- INFO:  Statistics for SPO: #relations = 543,723, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:50.153	- INFO:  Statistics for SOP: #relations = 543,723, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:50.153	- INFO:  Exchanging multiplicities for SPO and SOP ...
2022-05-21 09:26:50.251	- INFO:  Writing meta data for SPO and SOP ...
2022-05-21 09:26:50.253	- INFO:  Number of distinct patterns: 16
2022-05-21 09:26:50.253	- INFO:  Number of subjects with pattern: 543,723 [all]
2022-05-21 09:26:50.253	- INFO:  Total number of distinct subject-predicate pairs: 1,859,773
2022-05-21 09:26:50.253	- INFO:  Average number of predicates per subject: 3.4
2022-05-21 09:26:50.253	- INFO:  Average number of subjects per predicate: 109,398
2022-05-21 09:26:51.452	- INFO:  Creating a pair of index permutations ... 
2022-05-21 09:26:51.995	- INFO:  Statistics for OSP: #relations = 274,301, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:51.995	- INFO:  Statistics for OPS: #relations = 274,301, #blocks = 3, #triples = 2,055,674
2022-05-21 09:26:51.995	- INFO:  Exchanging multiplicities for OSP and OPS ...
2022-05-21 09:26:52.042	- INFO:  Writing meta data for OSP and OPS ...
2022-05-21 09:26:52.043	- INFO:  Index build completed

dblp trial

wget https://dblp.org/rdf/release/dblp-2022-05-02.nt.gz
--2022-05-21 12:01:06--  https://dblp.org/rdf/release/dblp-2022-05-02.nt.gz
gunzip dblp-2022-05-02.nt.gz 
wf@sun:/hd/seel/qlever/dblp$ ls -l
-rw-rw-r-- 1 wf wf 33687525594 Mai  3 00:16 dblp-2022-05-02.nt

qlever index

qlever index

This is the "qlever" script, call without argument for help

Executing "index":

cat dblp-2022-05-02.nt | IndexBuilderMain -F ttl -K dblp -f - -i dblp -s dblp.settings.json | tee dblp.index-log.txt

2022-05-22 17:56:59.354	- INFO:  QLever IndexBuilder, compiled on May 21 2022 08:50:52
2022-05-22 17:56:59.354	- INFO:  You specified the input format: TTL
2022-05-22 17:56:59.354	- INFO:  Locale was not specified in settings file, default is en_US
2022-05-22 17:56:59.354	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 0"
2022-05-22 17:56:59.354	- INFO:  You specified "num-triples-per-batch = 10,000,000", choose a lower value if the index builder runs out of memory
2022-05-22 17:56:59.354	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2022-05-22 17:56:59.354	- INFO:  Processing input triples from /dev/stdin ...
2022-05-22 17:59:12.063	- INFO:  Input triples processed: 100,000,000
2022-05-22 18:01:17.261	- INFO:  Input triples processed: 200,000,000
2022-05-22 18:02:38.857	- INFO:  Done, total number of triples read: 256,502,128 [may contain duplicates]
2022-05-22 18:02:38.857	- INFO:  Number of QLever-internal triples created: 288 [may contain duplicates]
2022-05-22 18:02:38.857	- INFO:  Merging partial vocabularies in byte order (internal only) ...
2022-05-22 18:03:46.727	- INFO:  Number of words in internal vocabulary: 87,377,190
2022-05-22 18:03:46.727	- INFO:  Merging partial vocabularies in Unicode order (internal and external) ...
2022-05-22 18:05:40.624	- INFO:  Number of words in external vocabulary: 0
2022-05-22 18:05:40.624	- INFO:  Removing temporary files ...
2022-05-22 18:05:42.627	- INFO:  Converting external vocabulary to binary format ...
2022-05-22 18:05:42.628	- INFO:  Converting triples from local IDs to global IDs ...
2022-05-22 18:06:14.748	- INFO:  Triples converted: 100,000,000
2022-05-22 18:06:56.179	- INFO:  Triples converted: 200,000,000
2022-05-22 18:07:14.548	- INFO:  Done, total number of triples converted: 256,502,416
2022-05-22 18:07:14.556	- INFO:  Building prefix tree from internal vocabulary ...
2022-05-22 18:07:48.496	- INFO:  Computing maximally compressing prefixes (greedy algorithm) ...
2022-05-22 18:09:54.929	- INFO:  Reduction of size of internal vocabulary: 23%
2022-05-22 18:09:58.222	- INFO:  Writing compressed vocabulary to disk ...
2022-05-22 18:11:48.809	- INFO:  Creating a pair of index permutations ... 
2022-05-22 18:13:45.741	- INFO:  Statistics for PSO: #relations = 65, #blocks = 521, #triples = 256,475,227
2022-05-22 18:13:45.741	- INFO:  Statistics for POS: #relations = 65, #blocks = 521, #triples = 256,475,227
2022-05-22 18:13:45.741	- INFO:  Exchanging multiplicities for PSO and POS ...
2022-05-22 18:13:45.741	- INFO:  Writing meta data for PSO and POS ...
2022-05-22 18:13:55.364	- INFO:  Creating a pair of index permutations ... 
2022-05-22 18:15:20.128	- INFO:  Statistics for SPO: #relations = 43,601,997, #blocks = 327, #triples = 256,475,227
2022-05-22 18:15:20.128	- INFO:  Statistics for SOP: #relations = 43,601,997, #blocks = 327, #triples = 256,475,227
2022-05-22 18:15:20.128	- INFO:  Exchanging multiplicities for SPO and SOP ...
2022-05-22 18:15:31.478	- INFO:  Writing meta data for SPO and SOP ...
2022-05-22 18:15:31.728	- INFO:  Number of distinct patterns: 1,275
2022-05-22 18:15:31.728	- INFO:  Number of subjects with pattern: 43,601,997 [all]
2022-05-22 18:15:31.728	- INFO:  Total number of distinct subject-predicate pairs: 222,339,673
2022-05-22 18:15:31.728	- INFO:  Average number of predicates per subject: 5.1
2022-05-22 18:15:31.729	- INFO:  Average number of subjects per predicate: 3,529,201
2022-05-22 18:15:43.918	- INFO:  Creating a pair of index permutations ... 
2022-05-22 18:17:05.826	- INFO:  Statistics for OSP: #relations = 81,329,417, #blocks = 419, #triples = 256,475,227
2022-05-22 18:17:05.826	- INFO:  Statistics for OPS: #relations = 81,329,417, #blocks = 419, #triples = 256,475,227
2022-05-22 18:17:05.826	- INFO:  Exchanging multiplicities for OSP and OPS ...
2022-05-22 18:17:29.011	- INFO:  Writing meta data for OSP and OPS ...
2022-05-22 18:17:29.366	- INFO:  Index build completed

Wikidata Index

Qleverfile

cat Qleverfile 
# Qleverfile for folder /hd/seel/qlever
# Automatically created on Sa 21. Mai 09:09:41 CEST 2022.
# Modify or expand as you see fit.

# Indexer settings
DB               = wikidata 
RDF_FILES        = "latest-all.ttl.bz2 latest-lexemes.ttl.bz2"
CAT_FILES        = "bzcat ${RDF_FILES}"
WITH_TEXT        = false
SETTINGS_JSON    = '{ "num-triples-per-batch": 10000000 }'

# Server settings
HOSTNAME                       = sun.bitplan.com
SERVER_PORT                    = 7001
MEMORY_FOR_QUERIES             = 10
CACHE_MAX_SIZE_GB              = 5
CACHE_MAX_SIZE_GB_SINGLE_ENTRY = 1
CACHE_MAX_NUM_ENTRIES          = 100

# QLever binaries
QLEVER_BIN_DIR          = /hd/seel/qlever/qlever-code/build/ 
USE_DOCKER              = false
QLEVER_DOCKER_IMAGE     = adfreiburg/qlever
QLEVER_DOCKER_CONTAINER = qlever.must_specify

# QLever UI
QLEVERUI_PORT   = 7000
QLEVERUI_DIR    = qlever-ui
QLEVERUI_CONFIG = default

Progress

nohup qlever index&
tail -f nohup.out
2022-05-22 17:48:22.562	- INFO:  QLever IndexBuilder, compiled on May 21 2022 08:50:52
2022-05-22 17:48:22.562	- INFO:  You specified the input format: TTL
2022-05-22 17:48:22.564	- INFO:  Locale was not specified in settings file, default is en_US
2022-05-22 17:48:22.564	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 0"
2022-05-22 17:48:22.564	- INFO:  You specified "num-triples-per-batch = 10,000,000", choose a lower value if the index builder runs out of memory
2022-05-22 17:48:22.564	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2022-05-22 17:48:22.564	- INFO:  Processing input triples from /dev/stdin ...
2022-05-22 17:50:47.349	- INFO:  Input triples processed: 100,000,000
2022-05-22 17:52:56.228	- INFO:  Input triples processed: 200,000,000
2022-05-22 17:54:58.520	- INFO:  Input triples processed: 300,000,000
...
2022-05-23 00:09:50.846	- INFO:  Input triples processed: 17,400,000,000
2022-05-23 00:10:52.613	- INFO:  Done, total number of triples read: 17,460,734,729 [may contain duplicates]
2022-05-23 00:10:52.614	- INFO:  Number of QLever-internal triples created: 10,625,008,810 [may contain duplicates]
2022-05-23 00:10:52.614	- INFO:  Merging partial vocabularies in byte order (internal only) ...
2022-05-23 00:10:53.144	- ERROR: ! ERROR opening file "wikidata.tmp.for-prefix-compression..tmp.partial-vocabulary.1018" with mode "r" (Too many open files)

QleverIndexProgress2022-05-22.jpg